{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "z-6LLOPZouLg"
   },
   "source": [
    "# LightEvalを使用してLLMを評価および分析する方法\n",
    "\n",
    "このノートブックでは、LightEvalを使用して大規模言語モデル（LLM）を評価および分析する方法を示します。LightEvalは、さまざまなタスクにわたってモデルの性能を評価するための柔軟で使いやすいフレームワークです。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "fXqd9BXgouLi"
   },
   "source": [
    "## 1. 開発環境のセットアップ\n",
    "\n",
    "最初のステップは、必要なライブラリをインストールすることです。`lighteval`、`transformers`、`datasets`を含みます。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "tKvGVxImouLi"
   },
   "outputs": [],
   "source": [
    "# Google Colabでの要件のインストール\n",
    "# !pip install lighteval transformers datasets\n",
    "\n",
    "# Hugging Faceへの認証\n",
    "\n",
    "from huggingface_hub import login\n",
    "\n",
    "login()\n",
    "\n",
    "# 便利のため、Hugging Faceのトークンを環境変数HF_TOKENとして設定できます"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "XHUzfwpKouLk"
   },
   "source": [
    "## 2. モデルとデータセットの読み込み"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "id": "z4p6Bvo7ouLk"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Downloading (…)okenizer_config.json: 100%|██████████| 28.0/28.0 [00:00<00:00, 28.0kB/s]\n",
       "Downloading (…)lve/main/config.json: 100%|██████████| 1.11k/1.11k [00:00<00:00, 1.11MB/s]\n",
       "Downloading (…)olve/main/vocab.json: 100%|██████████| 1.06M/1.06M [00:00<00:00, 1.06MB/s]\n",
       "Downloading (…)olve/main/merges.txt: 100%|██████████| 525k/525k [00:00<00:00, 525kB/s]\n",
       "Downloading (…)/main/tokenizer.json: 100%|██████████| 2.13M/2.13M [00:00<00:00, 2.13MB/s]\n",
       "Downloading (…)/main/adapter_config.json: 100%|██████████| 472/472 [00:00<00:00, 472kB/s]\n",
       "Downloading (…)/adapter_model.bin: 100%|██████████| 1.22G/1.22G [00:10<00:00, 120MB/s]\n",
       "Downloading pytorch_model.bin: 100%|██████████| 1.22G/1.22G [00:10<00:00, 120MB/s]\n"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# モデルとデータセットの読み込み\n",
    "from transformers import AutoModelForCausalLM, AutoTokenizer\n",
    "from datasets import load_dataset\n",
    "\n",
    "model_name = \"HuggingFaceTB/SmolLM2-135M\"\n",
    "model = AutoModelForCausalLM.from_pretrained(model_name)\n",
    "tokenizer = AutoTokenizer.from_pretrained(model_name)\n",
    "\n",
    "# サンプルデータセットの読み込み\n",
    "dataset = load_dataset(\"lighteval\", \"mmlu\")\n",
    "dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "9TOhJdtsouLk"
   },
   "source": [
    "## 3. LightEvalを使用した評価\n",
    "\n",
    "LightEvalを使用してモデルを評価するために、評価タスクを定義し、パイプラインを設定します。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# LightEvalのタスクとパイプラインをインポート\n",
    "from lighteval.tasks import Task, Pipeline\n",
    "from lighteval.metrics import EvaluationTracker\n",
    "\n",
    "# 評価するタスクを定義\n",
    "tasks = [\n",
    "    \"mmlu|anatomy|0|0\",\n",
    "    \"mmlu|high_school_biology|0|0\",\n",
    "    \"mmlu|high_school_chemistry|0|0\",\n",
    "    \"mmlu|professional_medicine|0|0\"\n",
    "]\n",
    "\n",
    "# パイプラインパラメータを設定\n",
    "pipeline_params = {\n",
    "    \"max_samples\": 40,  # 評価するサンプル数\n",
    "    \"batch_size\": 1,    # 推論のバッチサイズ\n",
    "    \"num_workers\": 4    # ワーカープロセスの数\n",
    "}\n",
    "\n",
    "# 評価トラッカーを作成\n",
    "evaluation_tracker = EvaluationTracker(\n",
    "    output_path=\"./results\",\n",
    "    save_generations=True\n",
    ")\n",
    "\n",
    "# パイプラインを作成\n",
    "pipeline = Pipeline(\n",
    "    tasks=tasks,\n",
    "    pipeline_parameters=pipeline_params,\n",
    "    evaluation_tracker=evaluation_tracker,\n",
    "    model=model\n",
    ")\n",
    "\n",
    "# 評価を実行\n",
    "pipeline.evaluate()\n",
    "\n",
    "# 結果を取得して表示\n",
    "results = pipeline.get_results()\n",
    "pipeline.show_results()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "-yO6E9quouLl"
   },
   "source": [
    "## 4. 結果の分析\n",
    "\n",
    "評価結果を分析し、モデルの性能を理解します。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# pandasを使用して結果を分析\n",
    "import pandas as pd\n",
    "\n",
    "# 結果をDataFrameに変換\n",
    "df = pd.DataFrame(results)\n",
    "\n",
    "# 結果を表示\n",
    "print(df)\n",
    "\n",
    "# 結果を視覚化\n",
    "df.plot(kind=\"bar\", x=\"Task\", y=\"Value\", legend=False)\n",
    "plt.ylabel(\"Accuracy\")\n",
    "plt.title(\"Model Performance on Different Tasks\")\n",
    "plt.show()"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "provenance": []
  },
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.10"
  },
  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
